The purpose of the case study is to classify a given silhouette as one of four different types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars. The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different

Project Execution as below :

  1. Data pre-processing - Understand the data and treat missing values (Use box plot), outliers
  2. Understanding the attributes - Find relationship between different attributes (Independent variables) and choose carefully which all attributes have to be a part of the analysis and why
  3. Use PCA from scikit learn and elbow plot to find out reduced number of dimension (which covers more than 95% of the variance)
  4. Use Support vector machines and use grid search (try C values - 0.01, 0.05, 0.5, 1 and kernel = linear, rbf) and find out the best hyper parameters and do cross validation to find the accuracy
In [1]:
#import the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score,confusion_matrix
from scipy.stats import zscore
from sklearn.model_selection import train_test_split
In [2]:
vehicle_df = pd.read_csv('/users/rajuhegde/desktop/Greatlearning/GLprojects/vehicle/vehicle.csv')
In [3]:
vehicle_df.head()
Out[3]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus
In [4]:
print("The dataframe has {} rows and {} columns".format(vehicle_df.shape[0],vehicle_df.shape[1]))
The dataframe has 846 rows and 19 columns
In [5]:
vehicle_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Data columns (total 19 columns):
compactness                    846 non-null int64
circularity                    841 non-null float64
distance_circularity           842 non-null float64
radius_ratio                   840 non-null float64
pr.axis_aspect_ratio           844 non-null float64
max.length_aspect_ratio        846 non-null int64
scatter_ratio                  845 non-null float64
elongatedness                  845 non-null float64
pr.axis_rectangularity         843 non-null float64
max.length_rectangularity      846 non-null int64
scaled_variance                843 non-null float64
scaled_variance.1              844 non-null float64
scaled_radius_of_gyration      844 non-null float64
scaled_radius_of_gyration.1    842 non-null float64
skewness_about                 840 non-null float64
skewness_about.1               845 non-null float64
skewness_about.2               845 non-null float64
hollows_ratio                  846 non-null int64
class                          846 non-null object
dtypes: float64(14), int64(4), object(1)
memory usage: 125.7+ KB

From above we can see that except 'class' column all columns are numeric type and there are null values in some columns. class column is our target column.

In [6]:
#display in each column how many null values are there
vehicle_df.apply(lambda x: sum(x.isnull()))
Out[6]:
compactness                    0
circularity                    5
distance_circularity           4
radius_ratio                   6
pr.axis_aspect_ratio           2
max.length_aspect_ratio        0
scatter_ratio                  1
elongatedness                  1
pr.axis_rectangularity         3
max.length_rectangularity      0
scaled_variance                3
scaled_variance.1              2
scaled_radius_of_gyration      2
scaled_radius_of_gyration.1    4
skewness_about                 6
skewness_about.1               1
skewness_about.2               1
hollows_ratio                  0
class                          0
dtype: int64

From above we can see that max null values is 6 which are in two columns 'radius_ratio', 'skewness_about'. so we have two options either we will drop those null values or we will impute those null values. Dropping null values is not a good way because we will lose some information.

In [7]:
#display 5 point summary of dataframe
vehicle_df.describe().transpose()
Out[7]:
count mean std min 25% 50% 75% max
compactness 846.0 93.678487 8.234474 73.0 87.00 93.0 100.0 119.0
circularity 841.0 44.828775 6.152172 33.0 40.00 44.0 49.0 59.0
distance_circularity 842.0 82.110451 15.778292 40.0 70.00 80.0 98.0 112.0
radius_ratio 840.0 168.888095 33.520198 104.0 141.00 167.0 195.0 333.0
pr.axis_aspect_ratio 844.0 61.678910 7.891463 47.0 57.00 61.0 65.0 138.0
max.length_aspect_ratio 846.0 8.567376 4.601217 2.0 7.00 8.0 10.0 55.0
scatter_ratio 845.0 168.901775 33.214848 112.0 147.00 157.0 198.0 265.0
elongatedness 845.0 40.933728 7.816186 26.0 33.00 43.0 46.0 61.0
pr.axis_rectangularity 843.0 20.582444 2.592933 17.0 19.00 20.0 23.0 29.0
max.length_rectangularity 846.0 147.998818 14.515652 118.0 137.00 146.0 159.0 188.0
scaled_variance 843.0 188.631079 31.411004 130.0 167.00 179.0 217.0 320.0
scaled_variance.1 844.0 439.494076 176.666903 184.0 318.00 363.5 587.0 1018.0
scaled_radius_of_gyration 844.0 174.709716 32.584808 109.0 149.00 173.5 198.0 268.0
scaled_radius_of_gyration.1 842.0 72.447743 7.486190 59.0 67.00 71.5 75.0 135.0
skewness_about 840.0 6.364286 4.920649 0.0 2.00 6.0 9.0 22.0
skewness_about.1 845.0 12.602367 8.936081 0.0 5.00 11.0 19.0 41.0
skewness_about.2 845.0 188.919527 6.155809 176.0 184.00 188.0 193.0 206.0
hollows_ratio 846.0 195.632388 7.438797 181.0 190.25 197.0 201.0 211.0

From above 5 point summary it's looks like we can impute with median.again by imputing the missing values with median we are changing the shape of distribution and introducing bias.but it's might be better than drpping missing values.

In [8]:
sns.pairplot(vehicle_df,diag_kind='kde')
plt.show()
/anaconda3/lib/python3.7/site-packages/statsmodels/nonparametric/kde.py:448: RuntimeWarning: invalid value encountered in greater
  X = X[np.logical_and(X > clip[0], X < clip[1])] # won't work for two columns.
/anaconda3/lib/python3.7/site-packages/statsmodels/nonparametric/kde.py:448: RuntimeWarning: invalid value encountered in less
  X = X[np.logical_and(X > clip[0], X < clip[1])] # won't work for two columns.

From above pair plots we can see that many columns are correlated and many columns have long tail so that is the indication of outliers.we will see down the line with the help of correlation matrix what's the strength of correlation and outliers are there or not.

From above we can see that our data has missing values in some column. so before building any model we have to handle missing values. we have two option either we will drop those missing values or we will impute missing values. we will go with both options and see what's the effect on model. so first we will drop the missing values. Before dropping missing values we will create another dataframe and copy the original dataframe data into that. It's a good practice to keep the original dataframe as it is and make all modifications to the new dataframe.

Imputing missing values

In [9]:
vehicle_df.fillna(vehicle_df.median(),axis=0,inplace=True)
In [10]:
new_vehicle_df = vehicle_df.copy()

so now we have new dataframe called new_vehicle_df and we will make analysis in this.

In [11]:
#display the first 5 rows of new dataframe
new_vehicle_df.head()
Out[11]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus
In [12]:
new_vehicle_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Data columns (total 19 columns):
compactness                    846 non-null int64
circularity                    846 non-null float64
distance_circularity           846 non-null float64
radius_ratio                   846 non-null float64
pr.axis_aspect_ratio           846 non-null float64
max.length_aspect_ratio        846 non-null int64
scatter_ratio                  846 non-null float64
elongatedness                  846 non-null float64
pr.axis_rectangularity         846 non-null float64
max.length_rectangularity      846 non-null int64
scaled_variance                846 non-null float64
scaled_variance.1              846 non-null float64
scaled_radius_of_gyration      846 non-null float64
scaled_radius_of_gyration.1    846 non-null float64
skewness_about                 846 non-null float64
skewness_about.1               846 non-null float64
skewness_about.2               846 non-null float64
hollows_ratio                  846 non-null int64
class                          846 non-null object
dtypes: float64(14), int64(4), object(1)
memory usage: 125.7+ KB
In [13]:
#display the shape of dataframe
print("Shape of newly created dataframe:",new_vehicle_df.shape)
Shape of newly created dataframe: (846, 19)
In [14]:
#display 5 point summary of new dataframe
new_vehicle_df.describe().transpose()
Out[14]:
count mean std min 25% 50% 75% max
compactness 846.0 93.678487 8.234474 73.0 87.00 93.0 100.00 119.0
circularity 846.0 44.823877 6.134272 33.0 40.00 44.0 49.00 59.0
distance_circularity 846.0 82.100473 15.741569 40.0 70.00 80.0 98.00 112.0
radius_ratio 846.0 168.874704 33.401356 104.0 141.00 167.0 195.00 333.0
pr.axis_aspect_ratio 846.0 61.677305 7.882188 47.0 57.00 61.0 65.00 138.0
max.length_aspect_ratio 846.0 8.567376 4.601217 2.0 7.00 8.0 10.00 55.0
scatter_ratio 846.0 168.887707 33.197710 112.0 147.00 157.0 198.00 265.0
elongatedness 846.0 40.936170 7.811882 26.0 33.00 43.0 46.00 61.0
pr.axis_rectangularity 846.0 20.580378 2.588558 17.0 19.00 20.0 23.00 29.0
max.length_rectangularity 846.0 147.998818 14.515652 118.0 137.00 146.0 159.00 188.0
scaled_variance 846.0 188.596927 31.360427 130.0 167.00 179.0 217.00 320.0
scaled_variance.1 846.0 439.314421 176.496341 184.0 318.25 363.5 586.75 1018.0
scaled_radius_of_gyration 846.0 174.706856 32.546277 109.0 149.00 173.5 198.00 268.0
scaled_radius_of_gyration.1 846.0 72.443262 7.468734 59.0 67.00 71.5 75.00 135.0
skewness_about 846.0 6.361702 4.903244 0.0 2.00 6.0 9.00 22.0
skewness_about.1 846.0 12.600473 8.930962 0.0 5.00 11.0 19.00 41.0
skewness_about.2 846.0 188.918440 6.152247 176.0 184.00 188.0 193.00 206.0
hollows_ratio 846.0 195.632388 7.438797 181.0 190.25 197.0 201.00 211.0

Analysis of each column with the help of plots

In [15]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['compactness'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(new_vehicle_df['compactness'],ax=ax2)
ax2.set_title("Box Plot")
Out[15]:
Text(0.5, 1.0, 'Box Plot')

From above we can see that there are no outliers in compactness column and it's looks like normally distributed.

In [16]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['circularity'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(new_vehicle_df['circularity'],ax=ax2)
ax2.set_title("Box Plot")
Out[16]:
Text(0.5, 1.0, 'Box Plot')

From above we can see that there are no outliers in circularity column and it's looks like normally distributed

In [17]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['distance_circularity'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(new_vehicle_df['distance_circularity'],ax=ax2)
ax2.set_title("Box Plot")
Out[17]:
Text(0.5, 1.0, 'Box Plot')

From above we can see that there are no outliers in distance_circularity column but in distribution plot we can see that there are two peaks and we can see that there is right skewness because long tail is at the right side(mean>median)

In [18]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['radius_ratio'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(new_vehicle_df['radius_ratio'],ax=ax2)
ax2.set_title("Box Plot")
Out[18]:
Text(0.5, 1.0, 'Box Plot')

From above we can see that there are outliers in radius_ratio column and there is right skewness because long tail is at the right side(mean>median)

In [19]:
#check how many outliers are there in radius_ratio column
q1 = np.quantile(new_vehicle_df['radius_ratio'],0.25)
q2 = np.quantile(new_vehicle_df['radius_ratio'],0.50)
q3 = np.quantile(new_vehicle_df['radius_ratio'],0.75)
IQR = q3-q1
print("Quartie1::",q1)
print("Quartie2::",q2)
print("Quartie3::",q3)
print("Inter Quartie Range::",IQR)
#outliers = q3 + 1.5*IQR, q1 - 1.5*IQR
print("radius_ratio above",new_vehicle_df['radius_ratio'].quantile(0.75)+(1.5 * IQR),"are outliers")
print("The Outliers in radius_ratio column are",new_vehicle_df[new_vehicle_df['radius_ratio']>276]['radius_ratio'].shape[0])
Quartie1:: 141.0
Quartie2:: 167.0
Quartie3:: 195.0
Inter Quartie Range:: 54.0
radius_ratio above 276.0 are outliers
The Outliers in radius_ratio column are 3
In [20]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['pr.axis_aspect_ratio'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(new_vehicle_df['pr.axis_aspect_ratio'],ax=ax2)
ax2.set_title("Box Plot")
Out[20]:
Text(0.5, 1.0, 'Box Plot')

From above we can see that there are outliers in pr.axis_aspect_ratio column and there is right skewness because long tail is at right side(mean>median)

In [21]:
#check how many outliers are there in pr.axis_aspect_ratio column
q1 = np.quantile(new_vehicle_df['pr.axis_aspect_ratio'],0.25)
q2 = np.quantile(new_vehicle_df['pr.axis_aspect_ratio'],0.50)
q3 = np.quantile(new_vehicle_df['pr.axis_aspect_ratio'],0.75)
IQR = q3-q1
print("Quartie1::",q1)
print("Quartie2::",q2)
print("Quartie3::",q3)
print("Inter Quartie Range::",IQR)
#outliers = q3 + 1.5*IQR, q1 - 1.5*IQR
print("pr.axis_aspect_ratio above",new_vehicle_df['pr.axis_aspect_ratio'].quantile(0.75)+(1.5 * IQR),"are outliers")
print("The Outliers in pr.axis_aspect_ratio column are",new_vehicle_df[new_vehicle_df['pr.axis_aspect_ratio']>77]['pr.axis_aspect_ratio'].shape[0])
Quartie1:: 57.0
Quartie2:: 61.0
Quartie3:: 65.0
Inter Quartie Range:: 8.0
pr.axis_aspect_ratio above 77.0 are outliers
The Outliers in pr.axis_aspect_ratio column are 8
In [22]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['max.length_aspect_ratio'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(new_vehicle_df['max.length_aspect_ratio'],ax=ax2)
ax2.set_title("Box Plot")
Out[22]:
Text(0.5, 1.0, 'Box Plot')

From above we can see that there are outliers in max.length_aspect_ratio and there is a right skewness because long tail is at right side(mean>median)

In [23]:
#check how many outliers are there in pr.axis_aspect_ratio column
q1 = np.quantile(new_vehicle_df['max.length_aspect_ratio'],0.25)
q2 = np.quantile(new_vehicle_df['max.length_aspect_ratio'],0.50)
q3 = np.quantile(new_vehicle_df['max.length_aspect_ratio'],0.75)
IQR = q3-q1
print("Quartie1::",q1)
print("Quartie2::",q2)
print("Quartie3::",q3)
print("Inter Quartie Range::",IQR)
#outliers = q3 + 1.5*IQR, q1 - 1.5*IQR
print("max.length_aspect_ratio above",new_vehicle_df['max.length_aspect_ratio'].quantile(0.75)+(1.5 * IQR),"are outliers")
print("max.length_aspect_ratio below",new_vehicle_df['max.length_aspect_ratio'].quantile(0.25)-(1.5 * IQR),"are outliers")
print("The above Outliers in max.length_aspect_ratio column are",new_vehicle_df[new_vehicle_df['max.length_aspect_ratio']>14.5]['max.length_aspect_ratio'].shape[0])
print("The below Outliers in max.length_aspect_ratio column are",new_vehicle_df[new_vehicle_df['max.length_aspect_ratio']<2.5]['max.length_aspect_ratio'].shape[0])
Quartie1:: 7.0
Quartie2:: 8.0
Quartie3:: 10.0
Inter Quartie Range:: 3.0
max.length_aspect_ratio above 14.5 are outliers
max.length_aspect_ratio below 2.5 are outliers
The above Outliers in max.length_aspect_ratio column are 12
The below Outliers in max.length_aspect_ratio column are 1
In [24]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['scatter_ratio'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(new_vehicle_df['scatter_ratio'],ax=ax2)
ax2.set_title("Box Plot")
Out[24]:
Text(0.5, 1.0, 'Box Plot')

From above we can see that there are no outliers in scatter_ratio column and there are two peaks in distribution plot and there is right skewness because long tail is at right side(mean>median)

In [25]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['elongatedness'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(new_vehicle_df['elongatedness'],ax=ax2)
ax2.set_title("Box Plot")
Out[25]:
Text(0.5, 1.0, 'Box Plot')

From above we can see that there are no outliers in elongatedness column and there are two peaks in distribution plot and there is left skewness because long tail is at left side(mean<median)

In [26]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['pr.axis_rectangularity'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(new_vehicle_df['pr.axis_rectangularity'],ax=ax2)
ax2.set_title("Box Plot")
Out[26]:
Text(0.5, 1.0, 'Box Plot')

From above we can see that there are no outliers in pr.axis_rectangularity column and there are two peaks in distribution plot and there is right skewness because long tail is at right side(mean>median)

In [27]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['max.length_rectangularity'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(new_vehicle_df['max.length_rectangularity'],ax=ax2)
ax2.set_title("Box Plot")
Out[27]:
Text(0.5, 1.0, 'Box Plot')

From above we can see that there are no outliers in max.length_rectangularity column and there are two peaks in distribution plot and there is right skewness because long tail is at right side(mean>median)

In [28]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['scaled_variance'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(new_vehicle_df['scaled_variance'],ax=ax2)
ax2.set_title("Box Plot")
Out[28]:
Text(0.5, 1.0, 'Box Plot')

From above we can see that there are outliers in scaled_variance column and there are two peaks in distribution plot and there is right skewness because long tail is at right side(mean>median)

In [29]:
#check how many outliers are there in scaled_variance column
q1 = np.quantile(new_vehicle_df['scaled_variance'],0.25)
q2 = np.quantile(new_vehicle_df['scaled_variance'],0.50)
q3 = np.quantile(new_vehicle_df['scaled_variance'],0.75)
IQR = q3-q1
print("Quartie1::",q1)
print("Quartie2::",q2)
print("Quartie3::",q3)
print("Inter Quartie Range::",IQR)
#outliers = q3 + 1.5*IQR, q1 - 1.5*IQR
print("scaled_variance above",new_vehicle_df['scaled_variance'].quantile(0.75)+(1.5 * IQR),"are outliers")
print("The Outliers in scaled_variance column are",new_vehicle_df[new_vehicle_df['scaled_variance']>292]['scaled_variance'].shape[0])
Quartie1:: 167.0
Quartie2:: 179.0
Quartie3:: 217.0
Inter Quartie Range:: 50.0
scaled_variance above 292.0 are outliers
The Outliers in scaled_variance column are 1
In [30]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['scaled_variance.1'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(new_vehicle_df['scaled_variance.1'],ax=ax2)
ax2.set_title("Box Plot")
Out[30]:
Text(0.5, 1.0, 'Box Plot')

From above we can see that there are outliers in scaled_variance.1 column and there are two peaks in distribution plot and there is right skewness because long tail is at right side(mean>median)

In [31]:
#check how many outliers are there in scaled_variance.1 column
q1 = np.quantile(new_vehicle_df['scaled_variance.1'],0.25)
q2 = np.quantile(new_vehicle_df['scaled_variance.1'],0.50)
q3 = np.quantile(new_vehicle_df['scaled_variance.1'],0.75)
IQR = q3-q1
print("Quartie1::",q1)
print("Quartie2::",q2)
print("Quartie3::",q3)
print("Inter Quartie Range::",IQR)
#outliers = q3 + 1.5*IQR, q1 - 1.5*IQR
print("scaled_variance.1 above",new_vehicle_df['scaled_variance.1'].quantile(0.75)+(1.5 * IQR),"are outliers")
print("The Outliers in scaled_variance.1 column are",new_vehicle_df[new_vehicle_df['scaled_variance.1']>988]['scaled_variance.1'].shape[0])
Quartie1:: 318.25
Quartie2:: 363.5
Quartie3:: 586.75
Inter Quartie Range:: 268.5
scaled_variance.1 above 989.5 are outliers
The Outliers in scaled_variance.1 column are 2
In [32]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['scaled_radius_of_gyration'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(new_vehicle_df['scaled_radius_of_gyration'],ax=ax2)
ax2.set_title("Box Plot")
Out[32]:
Text(0.5, 1.0, 'Box Plot')

From above we can see that there are no outliers in scaled_radius_of_gyration column and there is right skewness because long tail is at right side(mean>median)

In [33]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['scaled_radius_of_gyration.1'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(new_vehicle_df['scaled_radius_of_gyration.1'],ax=ax2)
ax2.set_title("Box Plot")
Out[33]:
Text(0.5, 1.0, 'Box Plot')

From above we can see that there are outliers in scaled_radius_of_gyration.1 column and there is right skewness because long tail is at right side(mean>median)

In [34]:
#check how many outliers are there in scaled_radius_of_gyration.1 column
q1 = np.quantile(new_vehicle_df['scaled_radius_of_gyration.1'],0.25)
q2 = np.quantile(new_vehicle_df['scaled_radius_of_gyration.1'],0.50)
q3 = np.quantile(new_vehicle_df['scaled_radius_of_gyration.1'],0.75)
IQR = q3-q1
print("Quartie1::",q1)
print("Quartie2::",q2)
print("Quartie3::",q3)
print("Inter Quartie Range::",IQR)
#outliers = q3 + 1.5*IQR, q1 - 1.5*IQR
print("scaled_radius_of_gyration.1 above",new_vehicle_df['scaled_radius_of_gyration.1'].quantile(0.75)+(1.5 * IQR),"are outliers")
print("The Outliers in scaled_radius_of_gyration.1 column are",new_vehicle_df[new_vehicle_df['scaled_radius_of_gyration.1']>87]['scaled_radius_of_gyration.1'].shape[0])
Quartie1:: 67.0
Quartie2:: 71.5
Quartie3:: 75.0
Inter Quartie Range:: 8.0
scaled_radius_of_gyration.1 above 87.0 are outliers
The Outliers in scaled_radius_of_gyration.1 column are 15
In [35]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['skewness_about'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(new_vehicle_df['skewness_about'],ax=ax2)
ax2.set_title("Box Plot")
Out[35]:
Text(0.5, 1.0, 'Box Plot')

From above we can see that there are outliers in skewness_about column and there is right skewness because long tail is at right side(mean>median)

In [36]:
#check how many outliers are there in skewness_about column
q1 = np.quantile(new_vehicle_df['skewness_about'],0.25)
q2 = np.quantile(new_vehicle_df['skewness_about'],0.50)
q3 = np.quantile(new_vehicle_df['skewness_about'],0.75)
IQR = q3-q1
print("Quartie1::",q1)
print("Quartie2::",q2)
print("Quartie3::",q3)
print("Inter Quartie Range::",IQR)
#outliers = q3 + 1.5*IQR, q1 - 1.5*IQR
print("skewness_about above",new_vehicle_df['skewness_about'].quantile(0.75)+(1.5 * IQR),"are outliers")
print("The Outliers in skewness_about column are",new_vehicle_df[new_vehicle_df['skewness_about']>19.5]['skewness_about'].shape[0])
Quartie1:: 2.0
Quartie2:: 6.0
Quartie3:: 9.0
Inter Quartie Range:: 7.0
skewness_about above 19.5 are outliers
The Outliers in skewness_about column are 12
In [37]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['skewness_about.1'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(new_vehicle_df['skewness_about.1'],ax=ax2)
ax2.set_title("Box Plot")
Out[37]:
Text(0.5, 1.0, 'Box Plot')

From above we can see that there are outliers in skewness_about.1 column and there is right skewness because long tail is at right side(mean>median)

In [38]:
#check how many outliers are there in skewness_about.1 column
q1 = np.quantile(new_vehicle_df['skewness_about.1'],0.25)
q2 = np.quantile(new_vehicle_df['skewness_about.1'],0.50)
q3 = np.quantile(new_vehicle_df['skewness_about.1'],0.75)
IQR = q3-q1
print("Quartie1::",q1)
print("Quartie2::",q2)
print("Quartie3::",q3)
print("Inter Quartie Range::",IQR)
#outliers = q3 + 1.5*IQR, q1 - 1.5*IQR
print("skewness_about.1 above",new_vehicle_df['skewness_about.1'].quantile(0.75)+(1.5 * IQR),"are outliers")
print("The Outliers in skewness_about.1 column are",new_vehicle_df[new_vehicle_df['skewness_about.1']>38.5]['skewness_about.1'].shape[0])
Quartie1:: 5.0
Quartie2:: 11.0
Quartie3:: 19.0
Inter Quartie Range:: 14.0
skewness_about.1 above 40.0 are outliers
The Outliers in skewness_about.1 column are 3
In [39]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['skewness_about.2'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(new_vehicle_df['skewness_about.2'],ax=ax2)
ax2.set_title("Box Plot")
Out[39]:
Text(0.5, 1.0, 'Box Plot')

From above we can see that there are no outliers in skewness_about.2 column and there is left skewness because long tail is at left side(mean<median)

In [40]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['hollows_ratio'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(new_vehicle_df['hollows_ratio'],ax=ax2)
ax2.set_title("Box Plot")
Out[40]:
Text(0.5, 1.0, 'Box Plot')

From above we can see that there are no outliers in hollows_ratio column and there is left skewness because long tail is at left side(mean<median)

In [41]:
#display how many are car,bus,van. 
new_vehicle_df['class'].value_counts()
Out[41]:
car    429
bus    218
van    199
Name: class, dtype: int64
In [42]:
sns.countplot(new_vehicle_df['class'])
plt.show()

From above we can see that cars are most followed by bus and then vans.

so by now we analyze each column and we found that there are outliers in some column. now our next step is to know whether these outliers are natural or artificial. if natural then we have to do nothing but if these outliers are artificial then we have to handle these outliers. we have 8 columns in which we found outliers: ->radius_ratio ->pr.axis_aspect_ratio ->max.length_aspect_ratio ->scaled_variance ->scaled_variance.1 ->scaled_radius_of_gyration.1 ->skewness_about ->skewness_about.1

after seeing the max values of above outliers column. it's looks like outliers in above columns are natural not a typo mistake or artificial. Note: It's my assumption only. as there is no way to prove whether these outliers are natural or artificial. As we know that mostly algorithms are affected by outliers and outliers may affect the model.as we will apply SVM on above data which is affected by outliers. so better to drop those outliers.

Fix Outliers after imputing missing values

In [43]:
impute_vehicle_df = new_vehicle_df.copy()
In [44]:
#radius_ratio column outliers
impute_vehicle_df.drop(impute_vehicle_df[impute_vehicle_df['radius_ratio']>276].index,axis=0,inplace=True)
In [45]:
#pr.axis_aspect_ratio column outliers
impute_vehicle_df.drop(impute_vehicle_df[impute_vehicle_df['pr.axis_aspect_ratio']>77].index,axis=0,inplace=True)
In [46]:
#max.length_aspect_ratio column outliers
impute_vehicle_df.drop(impute_vehicle_df[impute_vehicle_df['max.length_aspect_ratio']>14.5].index,axis=0,inplace=True)
impute_vehicle_df.drop(impute_vehicle_df[impute_vehicle_df['max.length_aspect_ratio']<2.5].index,axis=0,inplace=True)
In [47]:
#scaled_variance column outliers
impute_vehicle_df[impute_vehicle_df['scaled_variance']>292]
Out[47]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class

from above we can see that scaled_variance column outliers has been removed

In [48]:
#scaled_variance.1 column outliers
impute_vehicle_df.drop(impute_vehicle_df[impute_vehicle_df['scaled_variance.1']>989.5].index,axis=0,inplace=True)
In [49]:
#scaled_radius_of_gyration.1 column outliers
impute_vehicle_df.drop(impute_vehicle_df[impute_vehicle_df['scaled_radius_of_gyration.1']>87].index,axis=0,inplace=True)
In [50]:
#skewness_about column outliers
impute_vehicle_df.drop(impute_vehicle_df[impute_vehicle_df['skewness_about']>19.5].index,axis=0,inplace=True)
In [51]:
#skewness_about.1 column outliers
impute_vehicle_df.drop(impute_vehicle_df[impute_vehicle_df['skewness_about.1']>40].index,axis=0,inplace=True)
In [52]:
#display the shape of data frame
print("after fixing outliers shape of dataframe:",impute_vehicle_df.shape)
after fixing outliers shape of dataframe: (813, 19)
In [53]:
#find the correlation between independent variables
plt.figure(figsize=(20,5))
sns.heatmap(new_vehicle_df.corr(),annot=True)
plt.show()

so our objective is to reocgnize whether an object is a van or bus or car based on some input features. so our main assumption is there is little or no multicollinearity between the features. if two features is highly correlated then there is no use in using both features.in that case, we can drop one feature. so heatmap gives us the correlation matrix there we can see which features are highly correlated. From above correlation matrix we can see that there are many features which are highly correlated. if we see carefully then scaled_variance.1 and scatter_ratio has 1 correlation and many other features also there which having more than 0.9 correlation so we will drop those columns whose correlation is +-0.9 or above. so there are 8 such columns: ->max.length_rectangularity ->scaled_radius_of_gyration ->skewness_about.2 ->scatter_ratio ->elongatedness ->pr.axis_rectangularity ->scaled_variance ->scaled_variance.1

now, again we have two option we will drop those above eight columns manually or we will apply pca and let pca to be decided how it will explain above data which is in high dimension with smaller number of variables. we will see both approaches.

Principal Component Analysis is an unsupervised learning class of statistical techniques used to explain data in high dimension using small number of variables called the principal components. Principal components are the linear combinations of the original variables in the dataset. As it will explain high dimension data with small number of variables. The big disadvantage is we cannot do interpretation with the model.In other words model with pca will become blackbox. In pca first we have to find the covariance matrix after that from that covariance matrix we have to find eigen vectors and eigen values. There is mathematical way to find eigen vectors and eigen values. Corresponding to each eigen vector there is eigen value. after that we have to sort the eigen vector by decreasing eigen values and choose k eigen vectors with the largest eigen value.

With Principal Component Analysis(PCA)

In [54]:
#now separate the dataframe into dependent and independent variables
impute_vehicle_df_independent_attr = impute_vehicle_df.drop('class',axis=1)
impute_vehicle_df_dependent_attr = impute_vehicle_df['class']
print("shape of impute_vehicle_df_independent_attr::",impute_vehicle_df_independent_attr.shape)
print("shape of impute_vehicle_df_dependent_attr::",impute_vehicle_df_dependent_attr.shape)
shape of impute_vehicle_df_independent_attr:: (813, 18)
shape of impute_vehicle_df_dependent_attr:: (813,)
In [55]:
#now sclaed the independent attribute and replace the dependent attr value with number
impute_vehicle_df_independent_attr_scaled = impute_vehicle_df_independent_attr.apply(zscore)
impute_vehicle_df_dependent_attr.replace({'car':0,'bus':1,'van':2},inplace=True)
In [56]:
#make the covariance matrix and we have 18 independent features so aur covariance matrix is 18*18 matrix
impute_cov_matrix = np.cov(impute_vehicle_df_independent_attr_scaled,rowvar=False)
print("Impute cov_matrix shape:",impute_cov_matrix.shape)
print("Impute Covariance_matrix",impute_cov_matrix)
Impute cov_matrix shape: (18, 18)
Impute Covariance_matrix [[ 1.00123153e+00  6.80164027e-01  7.87792814e-01  7.46906930e-01
   2.00881439e-01  4.98273207e-01  8.11840645e-01 -7.89531434e-01
   8.12866245e-01  6.74996601e-01  7.92438680e-01  8.13494150e-01
   5.78399755e-01 -2.53990635e-01  2.00887113e-01  1.61304844e-01
   2.95777412e-01  3.64608943e-01]
 [ 6.80164027e-01  1.00123153e+00  7.87747162e-01  6.41725205e-01
   2.06409699e-01  5.64854067e-01  8.44804611e-01 -8.16768295e-01
   8.41196310e-01  9.62404205e-01  8.03750964e-01  8.33508154e-01
   9.26281607e-01  6.67790806e-02  1.40563881e-01 -1.43598307e-02
  -1.16976151e-01  3.92302597e-02]
 [ 7.87792814e-01  7.87747162e-01  1.00123153e+00  8.09326627e-01
   2.45756551e-01  6.69657073e-01  9.06692225e-01 -9.09806087e-01
   8.95884623e-01  7.69635504e-01  8.85221631e-01  8.89286924e-01
   7.03348558e-01 -2.38231284e-01  9.89345733e-02  2.63832735e-01
   1.29070982e-01  3.22051625e-01]
 [ 7.46906930e-01  6.41725205e-01  8.09326627e-01  1.00123153e+00
   6.67029240e-01  4.61258592e-01  7.90495472e-01 -8.45064567e-01
   7.64769672e-01  5.77501217e-01  7.93778346e-01  7.77097647e-01
   5.51222677e-01 -4.03672885e-01  4.03555670e-02  1.87420711e-01
   4.18869167e-01  5.05314324e-01]
 [ 2.00881439e-01  2.06409699e-01  2.45756551e-01  6.67029240e-01
   1.00123153e+00  1.38431761e-01  2.00217560e-01 -3.02289321e-01
   1.69961019e-01  1.46036511e-01  2.15074904e-01  1.86526180e-01
   1.53697623e-01 -3.25502385e-01 -5.16026240e-02 -2.86185855e-02
   4.06792617e-01  4.20318003e-01]
 [ 4.98273207e-01  5.64854067e-01  6.69657073e-01  4.61258592e-01
   1.38431761e-01  1.00123153e+00  4.98078976e-01 -5.02996017e-01
   4.97845069e-01  6.48642021e-01  4.12068816e-01  4.58456162e-01
   4.04786322e-01 -3.33161873e-01  8.41082601e-02  1.41145578e-01
   5.64852182e-02  3.94934461e-01]
 [ 8.11840645e-01  8.44804611e-01  9.06692225e-01  7.90495472e-01
   2.00217560e-01  4.98078976e-01  1.00123153e+00 -9.73537513e-01
   9.90659730e-01  8.08063766e-01  9.78751548e-01  9.94204811e-01
   7.95893849e-01  2.44702588e-03  6.35490363e-02  2.14445853e-01
  -3.10409338e-03  1.16323654e-01]
 [-7.89531434e-01 -8.16768295e-01 -9.09806087e-01 -8.45064567e-01
  -3.02289321e-01 -5.02996017e-01 -9.73537513e-01  1.00123153e+00
  -9.51112661e-01 -7.70982661e-01 -9.66090990e-01 -9.56973892e-01
  -7.63345981e-01  8.70842667e-02 -4.55135596e-02 -1.84181395e-01
  -1.05393355e-01 -2.11345600e-01]
 [ 8.12866245e-01  8.41196310e-01  8.95884623e-01  7.64769672e-01
   1.69961019e-01  4.97845069e-01  9.90659730e-01 -9.51112661e-01
   1.00123153e+00  8.11346565e-01  9.64981168e-01  9.88989478e-01
   7.93172901e-01  1.77904437e-02  7.28156271e-02  2.16892797e-01
  -2.65026808e-02  9.80719286e-02]
 [ 6.74996601e-01  9.62404205e-01  7.69635504e-01  5.77501217e-01
   1.46036511e-01  6.48642021e-01  8.08063766e-01 -7.70982661e-01
   8.11346565e-01  1.00123153e+00  7.50600479e-01  7.95049173e-01
   8.68007898e-01  5.26495142e-02  1.34795631e-01 -2.44448372e-03
  -1.17812145e-01  6.72596198e-02]
 [ 7.92438680e-01  8.03750964e-01  8.85221631e-01  7.93778346e-01
   2.15074904e-01  4.12068816e-01  9.78751548e-01 -9.66090990e-01
   9.64981168e-01  7.50600479e-01  1.00123153e+00  9.76750881e-01
   7.81984129e-01  1.68621531e-02  3.39888849e-02  2.05971428e-01
   2.28035846e-02  9.60435931e-02]
 [ 8.13494150e-01  8.33508154e-01  8.89286924e-01  7.77097647e-01
   1.86526180e-01  4.58456162e-01  9.94204811e-01 -9.56973892e-01
   9.88989478e-01  7.95049173e-01  9.76750881e-01  1.00123153e+00
   7.90805725e-01  1.62348310e-02  6.49567636e-02  2.03838067e-01
   7.85566308e-05  1.03330899e-01]
 [ 5.78399755e-01  9.26281607e-01  7.03348558e-01  5.51222677e-01
   1.53697623e-01  4.04786322e-01  7.95893849e-01 -7.63345981e-01
   7.93172901e-01  8.68007898e-01  7.81984129e-01  7.90805725e-01
   1.00123153e+00  2.16651698e-01  1.68973862e-01 -5.83635746e-02
  -2.32617810e-01 -1.20727281e-01]
 [-2.53990635e-01  6.67790806e-02 -2.38231284e-01 -4.03672885e-01
  -3.25502385e-01 -3.33161873e-01  2.44702588e-03  8.70842667e-02
   1.77904437e-02  5.26495142e-02  1.68621531e-02  1.62348310e-02
   2.16651698e-01  1.00123153e+00 -5.93373719e-02 -1.31142620e-01
  -8.43627948e-01 -9.18420730e-01]
 [ 2.00887113e-01  1.40563881e-01  9.89345733e-02  4.03555670e-02
  -5.16026240e-02  8.41082601e-02  6.35490363e-02 -4.55135596e-02
   7.28156271e-02  1.34795631e-01  3.39888849e-02  6.49567636e-02
   1.68973862e-01 -5.93373719e-02  1.00123153e+00 -4.53538836e-02
   8.48972195e-02  6.12111362e-02]
 [ 1.61304844e-01 -1.43598307e-02  2.63832735e-01  1.87420711e-01
  -2.86185855e-02  1.41145578e-01  2.14445853e-01 -1.84181395e-01
   2.16892797e-01 -2.44448372e-03  2.05971428e-01  2.03838067e-01
  -5.83635746e-02 -1.31142620e-01 -4.53538836e-02  1.00123153e+00
   7.28908031e-02  2.00156475e-01]
 [ 2.95777412e-01 -1.16976151e-01  1.29070982e-01  4.18869167e-01
   4.06792617e-01  5.64852182e-02 -3.10409338e-03 -1.05393355e-01
  -2.65026808e-02 -1.17812145e-01  2.28035846e-02  7.85566308e-05
  -2.32617810e-01 -8.43627948e-01  8.48972195e-02  7.28908031e-02
   1.00123153e+00  8.91041674e-01]
 [ 3.64608943e-01  3.92302597e-02  3.22051625e-01  5.05314324e-01
   4.20318003e-01  3.94934461e-01  1.16323654e-01 -2.11345600e-01
   9.80719286e-02  6.72596198e-02  9.60435931e-02  1.03330899e-01
  -1.20727281e-01 -9.18420730e-01  6.12111362e-02  2.00156475e-01
   8.91041674e-01  1.00123153e+00]]
In [57]:
#now with the help of above covariance matrix we will find eigen value and eigen vectors
impute_pca_to_learn_variance = PCA(n_components=18)
impute_pca_to_learn_variance.fit(impute_vehicle_df_independent_attr_scaled)
Out[57]:
PCA(copy=True, iterated_power='auto', n_components=18, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)
In [58]:
#display explained variance ratio
impute_pca_to_learn_variance.explained_variance_ratio_
Out[58]:
array([5.43385012e-01, 1.87386253e-01, 6.70690992e-02, 6.30665320e-02,
       4.97324675e-02, 3.65268566e-02, 1.79255090e-02, 1.25904175e-02,
       6.25572293e-03, 4.22850947e-03, 3.43129149e-03, 2.45487103e-03,
       1.66416799e-03, 1.48558789e-03, 1.07943424e-03, 9.83188815e-04,
       5.61620004e-04, 1.73459006e-04])
In [59]:
#display explained variance
impute_pca_to_learn_variance.explained_variance_
Out[59]:
array([9.79297570e+00, 3.37710644e+00, 1.20873054e+00, 1.13659560e+00,
       8.96286859e-01, 6.58293128e-01, 3.23056525e-01, 2.26906613e-01,
       1.12741686e-01, 7.62069059e-02, 6.18393099e-02, 4.42420969e-02,
       2.99919142e-02, 2.67735138e-02, 1.94537446e-02, 1.77191935e-02,
       1.01216098e-02, 3.12610726e-03])
In [60]:
#display principal components
impute_pca_to_learn_variance.components_
Out[60]:
array([[ 2.72251046e-01,  2.85370045e-01,  3.01486231e-01,
         2.72594510e-01,  9.85797647e-02,  1.94755787e-01,
         3.10518442e-01, -3.08438338e-01,  3.07548493e-01,
         2.76301073e-01,  3.02748114e-01,  3.07040626e-01,
         2.61520489e-01, -4.36323635e-02,  3.67057041e-02,
         5.88504115e-02,  3.48373860e-02,  8.28136172e-02],
       [-8.97284818e-02,  1.33173937e-01, -4.40259591e-02,
        -2.04232234e-01, -2.59136858e-01, -9.45756320e-02,
         7.23350799e-02, -1.16876769e-02,  8.40915278e-02,
         1.25836631e-01,  7.01998575e-02,  7.79336637e-02,
         2.09927277e-01,  5.03914450e-01, -1.45682524e-02,
        -9.33980545e-02, -5.01664210e-01, -5.06546563e-01],
       [-2.26045073e-02, -2.10809943e-01,  7.08780817e-02,
         4.02139629e-02, -1.14805227e-01, -1.39313484e-01,
         1.12924698e-01, -9.00330455e-02,  1.11063547e-01,
        -2.19877688e-01,  1.44818765e-01,  1.15323952e-01,
        -2.13627435e-01,  6.73920886e-02, -5.21623444e-01,
         6.87170643e-01, -6.22069465e-02, -4.08035393e-02],
       [-1.30419032e-01,  2.06785531e-02, -1.07425217e-01,
         2.52957341e-01,  6.05228001e-01, -3.22531411e-01,
         1.00540370e-02, -7.99117560e-02, -1.60464922e-02,
        -6.66507863e-02,  6.98045095e-02,  1.73631584e-02,
         7.22457181e-02,  1.35860558e-01, -4.90121679e-01,
        -3.80232477e-01,  3.55391597e-02, -1.03008417e-01],
       [ 1.52324139e-01, -1.39022591e-01, -8.07335409e-02,
         1.19012554e-01,  8.32128223e-02, -6.21376071e-01,
         8.12405608e-02, -7.47379231e-02,  7.75020996e-02,
        -2.46140560e-01,  1.49584067e-01,  1.15117310e-01,
        -7.54871674e-03,  1.40527774e-01,  5.89800103e-01,
         1.27793729e-01,  1.81582693e-01, -1.11256244e-01],
       [ 2.58374578e-01, -6.88979940e-02, -2.04800896e-02,
        -1.39449676e-01, -5.87145492e-01, -2.65624695e-01,
         8.93335163e-02, -7.25853857e-02,  9.60554272e-02,
        -6.35014904e-02,  1.34458896e-01,  1.26968672e-01,
        -7.33961842e-02, -1.31928871e-01, -3.12415086e-01,
        -4.82506903e-01,  2.75222340e-01,  6.05771535e-02],
       [ 1.88794221e-01, -3.90871235e-01,  1.76384547e-01,
         1.56474448e-01,  1.02492950e-01,  3.98851794e-01,
         9.14237336e-02, -1.04875746e-01,  9.06723384e-02,
        -3.49667685e-01,  7.54753072e-02,  6.99641470e-02,
        -4.55851958e-01,  7.90311042e-02,  1.30187397e-01,
        -3.10629290e-01, -2.59557864e-01, -1.76348774e-01],
       [ 7.71578238e-01,  6.60528436e-02, -2.98693883e-01,
        -5.20410402e-02,  1.61872497e-01,  5.85800952e-02,
        -8.45300921e-02,  2.16815347e-01, -3.37069994e-02,
         2.26684736e-01, -1.45772665e-01, -5.32611781e-02,
        -1.58194670e-01,  3.00374428e-01, -1.14687509e-01,
         1.18168951e-01,  7.27008273e-02, -1.81034286e-02],
       [ 3.61784776e-01,  4.62957583e-02,  2.64499195e-01,
         1.70430331e-01, -1.17212341e-02, -1.73213170e-01,
        -1.37499298e-01,  2.59988735e-01, -1.03269951e-01,
        -2.44776407e-01, -5.85239946e-02, -1.28904560e-01,
         3.37170589e-01, -5.01365221e-01, -7.50393829e-02,
         3.07213623e-02, -3.62122453e-01, -2.40710780e-01],
       [-1.25233628e-01,  2.40262612e-01, -9.42971834e-02,
         8.97062530e-02,  2.87528583e-02, -2.49937617e-01,
         1.11244025e-01,  1.24837047e-01,  2.11468012e-01,
         3.87473859e-01, -1.47036092e-01,  1.60305310e-01,
        -5.87690102e-01, -3.87030017e-01,  5.41502565e-02,
        -1.36044539e-02, -2.20343289e-01, -1.71416688e-01],
       [-2.92009470e-02, -7.29503235e-02, -7.78755026e-01,
         1.31647081e-01, -4.97534613e-02,  1.98444456e-01,
         1.61642905e-01,  4.29365477e-03,  2.40841717e-01,
        -2.24580349e-01, -2.06902072e-02,  1.96322990e-01,
         2.58436921e-01, -2.27875444e-01,  1.39861362e-02,
         1.77010708e-02, -1.73696003e-01,  7.22825606e-02],
       [-7.62442008e-04, -1.93799916e-01,  2.32649049e-01,
        -2.75143903e-01,  1.45558629e-01, -1.72600201e-01,
         8.22439493e-02,  3.50089602e-01,  3.42527317e-01,
        -3.05154380e-02, -2.33368955e-01,  2.75169550e-01,
         1.07063554e-01,  1.38958435e-01, -5.61401152e-03,
        -8.59021362e-02, -2.79657886e-01,  5.36171185e-01],
       [-1.01407495e-01, -3.11337823e-01,  5.89166755e-02,
        -2.04574984e-01,  1.50893891e-01,  1.76055013e-01,
        -1.51805844e-02,  4.61164909e-01,  2.18872117e-01,
         1.53765067e-01,  1.79499013e-01,  2.20362642e-01,
         1.43753708e-01, -1.34656976e-01, -1.37166771e-02,
         2.72433694e-02,  4.14581122e-01, -4.65683959e-01],
       [ 1.46326861e-01, -1.96463651e-01, -5.33931974e-02,
        -6.58916577e-01,  2.89610835e-01, -6.68511988e-02,
         7.66778803e-02, -5.23226723e-01, -2.39504315e-02,
         1.04419937e-01, -1.16604375e-02, -7.99305617e-02,
         5.21969873e-02, -3.04769192e-01,  4.76724453e-03,
         2.97178011e-02, -1.14797284e-01, -8.53480643e-02],
       [-3.32992130e-03, -5.83996136e-01, -8.64160083e-02,
         2.71300494e-01, -9.64017331e-02, -1.10841470e-01,
        -8.33248999e-02,  1.36447171e-02, -1.72817545e-01,
         5.43122947e-01,  3.24937516e-01, -1.42051799e-01,
         8.32177228e-02, -3.01217731e-02,  2.14301813e-02,
        -1.83842486e-02, -2.41026732e-01,  1.78387852e-01],
       [-3.81638532e-03, -2.96230720e-01,  9.72735293e-02,
         2.74900989e-01, -1.19100067e-01, -2.92959443e-02,
         5.60355480e-02, -2.65096114e-01,  2.70709305e-01,
         1.53673085e-01, -7.26163025e-01, -1.22815848e-01,
         1.69567965e-01,  5.39469506e-02, -3.27151282e-02,
         1.82173722e-02,  1.66961820e-01, -1.96223612e-01],
       [ 1.05983722e-02, -8.71766559e-02,  2.28724292e-02,
         2.90668794e-02, -9.40948646e-03,  1.20980507e-02,
         2.72442207e-01,  2.61394487e-03, -6.84892390e-01,
         4.47385929e-02, -2.54510995e-01,  6.13103868e-01,
         4.41891377e-02, -1.59765660e-02, -5.03222786e-03,
         1.10992435e-02,  7.76499049e-03, -4.78049584e-02],
       [-1.06680587e-02, -7.74670931e-03,  1.11905744e-02,
        -3.74689248e-02,  2.09842091e-02, -1.06888298e-02,
         8.37148260e-01,  2.42295907e-01, -9.86931593e-02,
        -1.40549391e-02,  1.43866319e-02, -4.75672122e-01,
         8.61256926e-03,  7.55464886e-03, -2.19811008e-03,
        -1.39575997e-02,  3.82401827e-02,  3.98716359e-03]])
In [61]:
plt.bar(list(range(1,19)),impute_pca_to_learn_variance.explained_variance_ratio_)
plt.xlabel("eigen value/components")
plt.ylabel("variation explained")
plt.show()
In [62]:
plt.step(list(range(1,19)),np.cumsum(impute_pca_to_learn_variance.explained_variance_ratio_))
plt.xlabel("eigen value/components")
plt.ylabel("cummalative of variation explained")
plt.show()

From above we can see that 8 dimension are able to explain 95%variance of data. so we will use first 8 principal components

In [63]:
#use first 8 principal components
impute_pca_eight_components = PCA(n_components=8)
impute_pca_eight_components.fit(impute_vehicle_df_independent_attr_scaled)
Out[63]:
PCA(copy=True, iterated_power='auto', n_components=8, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)
In [64]:
#transform the impute raw data which is in 18 dimension into 8 new dimension with pca
impute_vehicle_df_pca_independent_attr = impute_pca_eight_components.transform(impute_vehicle_df_independent_attr_scaled)
In [65]:
#display the shape of new_vehicle_df_pca_independent_attr
impute_vehicle_df_pca_independent_attr.shape
Out[65]:
(813, 8)

now before apply pca with 8 dimension which are explaining more than 95% variantion of data we will make model on raw data after that we will make model with pca and then we will compare both models.

In [66]:
#now split the data into 80:20 ratio
impute_rawdata_X_train,impute_rawdata_X_test,impute_rawdata_y_train,impute_rawdata_y_test = train_test_split(impute_vehicle_df_independent_attr_scaled,impute_vehicle_df_dependent_attr,test_size=0.20,random_state=1)
impute_pca_X_train,impute_pca_X_test,impute_pca_y_train,impute_pca_y_test = train_test_split(impute_vehicle_df_pca_independent_attr,impute_vehicle_df_dependent_attr,test_size=0.20,random_state=1)
In [67]:
print("shape of impute_rawdata_X_train",impute_rawdata_X_train.shape)
print("shape of impute_rawdata_y_train",impute_rawdata_y_train.shape)
print("shape of impute_rawdata_X_test",impute_rawdata_X_test.shape)
print("shape of impute_rawdata_y_test",impute_rawdata_y_test.shape)
print("--------------------------------------------")
print("shape of impute_pca_X_train",impute_pca_X_train.shape)
print("shape of impute_pca_y_train",impute_pca_y_train.shape)
print("shape of impute_pca_X_test",impute_pca_X_test.shape)
print("shape of impute_pca_y_test",impute_pca_y_test.shape)
shape of impute_rawdata_X_train (650, 18)
shape of impute_rawdata_y_train (650,)
shape of impute_rawdata_X_test (163, 18)
shape of impute_rawdata_y_test (163,)
--------------------------------------------
shape of impute_pca_X_train (650, 8)
shape of impute_pca_y_train (650,)
shape of impute_pca_X_test (163, 8)
shape of impute_pca_y_test (163,)

Use Support vector machines and use grid search (try C values - 0.01, 0.05, 0.5, 1) and (kernel : linear, rbf) and find out the best hyper parameters and do cross validation to find the accuracy.

SVM - C value 0.01 and kernel - Linear

In [68]:
from sklearn.svm import SVC
svc = SVC(C=0.01,kernel='linear', degree=3)
In [69]:
#fit the model on impute raw data
svc.fit(impute_rawdata_X_train,impute_rawdata_y_train)
Out[69]:
SVC(C=0.01, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
In [70]:
#predict the y value
impute_rawdata_y_predict = svc.predict(impute_rawdata_X_test)
In [71]:
#now fit the model on pca data with new dimension
svc.fit(impute_pca_X_train,impute_pca_y_train)
Out[71]:
SVC(C=0.01, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
In [72]:
#predict the y value
impute_pca_y_predict = svc.predict(impute_pca_X_test)
In [73]:
accrd_001_linear =accuracy_score(impute_rawdata_y_test,impute_rawdata_y_predict)
accpca_001_linear =accuracy_score(impute_pca_y_test,impute_pca_y_predict)
print(accrd_001_linear)
print(accpca_001_linear)
0.8957055214723927
0.8220858895705522
In [74]:
#display accuracy score of both models
print("Accuracy score with impute raw data(18 dimension)",accuracy_score(impute_rawdata_y_test,impute_rawdata_y_predict))
print("Accuracy score with impute pca data(8 dimension)",accuracy_score(impute_pca_y_test,impute_pca_y_predict))
Accuracy score with impute raw data(18 dimension) 0.8957055214723927
Accuracy score with impute pca data(8 dimension) 0.8220858895705522
In [75]:
#display confusion matrix of both models
print("Confusion matrix with impute raw data(18 dimension)\n",confusion_matrix(impute_rawdata_y_test,impute_rawdata_y_predict))
print("Confusion matrix with impute pca data(8 dimension)\n",confusion_matrix(impute_pca_y_test,impute_pca_y_predict))
Confusion matrix with impute raw data(18 dimension)
 [[80  3  1]
 [ 9 43  1]
 [ 2  1 23]]
Confusion matrix with impute pca data(8 dimension)
 [[76  6  2]
 [17 35  1]
 [ 1  2 23]]
In [76]:
resultsDf = pd.DataFrame({'Method':['C-001 & linear'], 'Accuracy score with impute raw data(18 dimension)':accrd_001_linear , 'Accuracy score with impute pca data(8 dimension)' :accpca_001_linear })
resultsDf = resultsDf[['Method', 'Accuracy score with impute raw data(18 dimension)','Accuracy score with impute pca data(8 dimension)']]
resultsDf
Out[76]:
Method Accuracy score with impute raw data(18 dimension) Accuracy score with impute pca data(8 dimension)
0 C-001 & linear 0.895706 0.822086
In [77]:
# SVM - C value 0.05 and kernel - Linear
In [78]:
svc = SVC(C=0.05,kernel='linear', degree=3)
In [79]:
#fit the model on impute raw data
svc.fit(impute_rawdata_X_train,impute_rawdata_y_train)
Out[79]:
SVC(C=0.05, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
In [80]:
#predict the y value
impute_rawdata_y_predict = svc.predict(impute_rawdata_X_test)
In [81]:
#now fit the model on pca data with new dimension
svc.fit(impute_pca_X_train,impute_pca_y_train)
Out[81]:
SVC(C=0.05, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
In [82]:
#predict the y value
impute_pca_y_predict = svc.predict(impute_pca_X_test)
In [83]:
accrd_005_linear =accuracy_score(impute_rawdata_y_test,impute_rawdata_y_predict)
accpca_005_linear =accuracy_score(impute_pca_y_test,impute_pca_y_predict)
print(accrd_005_linear)
print(accpca_005_linear)
0.9325153374233128
0.8650306748466258
In [84]:
#display accuracy score of both models
print("Accuracy score with impute raw data(18 dimension)",accuracy_score(impute_rawdata_y_test,impute_rawdata_y_predict))
print("Accuracy score with impute pca data(8 dimension)",accuracy_score(impute_pca_y_test,impute_pca_y_predict))
Accuracy score with impute raw data(18 dimension) 0.9325153374233128
Accuracy score with impute pca data(8 dimension) 0.8650306748466258
In [85]:
tempResultsDf = pd.DataFrame({'Method':['C-005 & linear'], 'Accuracy score with impute raw data(18 dimension)':accrd_005_linear , 'Accuracy score with impute pca data(8 dimension)' :accpca_005_linear })
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'Accuracy score with impute raw data(18 dimension)','Accuracy score with impute pca data(8 dimension)']]
resultsDf
Out[85]:
Method Accuracy score with impute raw data(18 dimension) Accuracy score with impute pca data(8 dimension)
0 C-001 & linear 0.895706 0.822086
0 C-005 & linear 0.932515 0.865031

Accuracy has increased with C-Value change

In [86]:
# SVM - C value 0.5 and kernel - Linear
In [87]:
svc = SVC(C=0.5,kernel='linear', degree=3)
In [88]:
#fit the model on impute raw data
svc.fit(impute_rawdata_X_train,impute_rawdata_y_train)
Out[88]:
SVC(C=0.5, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
In [89]:
#predict the y value
impute_rawdata_y_predict = svc.predict(impute_rawdata_X_test)
In [90]:
#now fit the model on pca data with new dimension
svc.fit(impute_pca_X_train,impute_pca_y_train)
Out[90]:
SVC(C=0.5, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
In [91]:
#predict the y value
impute_pca_y_predict = svc.predict(impute_pca_X_test)
In [92]:
accrd_05_linear =accuracy_score(impute_rawdata_y_test,impute_rawdata_y_predict)
accpca_05_linear =accuracy_score(impute_pca_y_test,impute_pca_y_predict)
print(accrd_05_linear)
print(accpca_05_linear)
0.9447852760736196
0.852760736196319
In [93]:
#display accuracy score of both models
print("Accuracy score with impute raw data(18 dimension)",accuracy_score(impute_rawdata_y_test,impute_rawdata_y_predict))
print("Accuracy score with impute pca data(8 dimension)",accuracy_score(impute_pca_y_test,impute_pca_y_predict))
Accuracy score with impute raw data(18 dimension) 0.9447852760736196
Accuracy score with impute pca data(8 dimension) 0.852760736196319
In [94]:
tempResultsDf = pd.DataFrame({'Method':['C-05 & linear'], 'Accuracy score with impute raw data(18 dimension)':accrd_05_linear , 'Accuracy score with impute pca data(8 dimension)' :accpca_05_linear })
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'Accuracy score with impute raw data(18 dimension)','Accuracy score with impute pca data(8 dimension)']]
resultsDf
Out[94]:
Method Accuracy score with impute raw data(18 dimension) Accuracy score with impute pca data(8 dimension)
0 C-001 & linear 0.895706 0.822086
0 C-005 & linear 0.932515 0.865031
0 C-05 & linear 0.944785 0.852761

Accuracy has increased with C-Value change

In [95]:
# SVM - C value 1 and kernel - Linear
In [96]:
svc = SVC(C=1,kernel='linear', degree=3)
In [97]:
#fit the model on impute raw data
svc.fit(impute_rawdata_X_train,impute_rawdata_y_train)
Out[97]:
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
In [98]:
#predict the y value
impute_rawdata_y_predict = svc.predict(impute_rawdata_X_test)
In [99]:
#now fit the model on pca data with new dimension
svc.fit(impute_pca_X_train,impute_pca_y_train)
Out[99]:
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
In [100]:
#predict the y value
impute_pca_y_predict = svc.predict(impute_pca_X_test)
In [101]:
accrd_1_linear =accuracy_score(impute_rawdata_y_test,impute_rawdata_y_predict)
accpca_1_linear =accuracy_score(impute_pca_y_test,impute_pca_y_predict)
print(accrd_1_linear)
print(accpca_1_linear)
0.9447852760736196
0.852760736196319
In [102]:
#display accuracy score of both models
print("Accuracy score with impute raw data(18 dimension)",accuracy_score(impute_rawdata_y_test,impute_rawdata_y_predict))
print("Accuracy score with impute pca data(8 dimension)",accuracy_score(impute_pca_y_test,impute_pca_y_predict))
Accuracy score with impute raw data(18 dimension) 0.9447852760736196
Accuracy score with impute pca data(8 dimension) 0.852760736196319
In [103]:
tempResultsDf = pd.DataFrame({'Method':['C-1 & linear'], 'Accuracy score with impute raw data(18 dimension)':accrd_1_linear , 'Accuracy score with impute pca data(8 dimension)' :accpca_1_linear })
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'Accuracy score with impute raw data(18 dimension)','Accuracy score with impute pca data(8 dimension)']]
resultsDf
Out[103]:
Method Accuracy score with impute raw data(18 dimension) Accuracy score with impute pca data(8 dimension)
0 C-001 & linear 0.895706 0.822086
0 C-005 & linear 0.932515 0.865031
0 C-05 & linear 0.944785 0.852761
0 C-1 & linear 0.944785 0.852761

From the above we can see that C - 05 & 1 with kernel linear is giving more or less same accuracy in raw data and PCA data. C-0.05 with kernel linear is the best highest accuracy 0.8650 for PCA data

Now lets change the kernel to rbf

In [104]:
# SVM - C value 0.01 and kernel - rbf
In [105]:
svc = SVC(C=0.01,kernel='rbf', degree=3)
In [106]:
#fit the model on impute raw data
svc.fit(impute_rawdata_X_train,impute_rawdata_y_train)
/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
Out[106]:
SVC(C=0.01, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
In [107]:
#predict the y value
impute_rawdata_y_predict = svc.predict(impute_rawdata_X_test)
In [108]:
#now fit the model on pca data with new dimension
svc.fit(impute_pca_X_train,impute_pca_y_train)
/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
Out[108]:
SVC(C=0.01, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
In [109]:
#predict the y value
impute_pca_y_predict = svc.predict(impute_pca_X_test)
In [110]:
accrd_001_rbf =accuracy_score(impute_rawdata_y_test,impute_rawdata_y_predict)
accpca_001_rbf =accuracy_score(impute_pca_y_test,impute_pca_y_predict)
print(accrd_001_rbf)
print(accpca_001_rbf)
0.5153374233128835
0.5153374233128835
In [111]:
#display accuracy score of both models
print("Accuracy score with impute raw data(18 dimension)",accuracy_score(impute_rawdata_y_test,impute_rawdata_y_predict))
print("Accuracy score with impute pca data(8 dimension)",accuracy_score(impute_pca_y_test,impute_pca_y_predict))
Accuracy score with impute raw data(18 dimension) 0.5153374233128835
Accuracy score with impute pca data(8 dimension) 0.5153374233128835
In [112]:
tempResultsDf = pd.DataFrame({'Method':['C-001 & rbf'], 'Accuracy score with impute raw data(18 dimension)':accrd_001_rbf , 'Accuracy score with impute pca data(8 dimension)' :accpca_001_rbf })
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'Accuracy score with impute raw data(18 dimension)','Accuracy score with impute pca data(8 dimension)']]
resultsDf
Out[112]:
Method Accuracy score with impute raw data(18 dimension) Accuracy score with impute pca data(8 dimension)
0 C-001 & linear 0.895706 0.822086
0 C-005 & linear 0.932515 0.865031
0 C-05 & linear 0.944785 0.852761
0 C-1 & linear 0.944785 0.852761
0 C-001 & rbf 0.515337 0.515337
In [113]:
# SVM - C value 0.05 and kernel - rbf
In [114]:
svc = SVC(C=0.05,kernel='rbf', degree=3)
In [115]:
#fit the model on impute raw data
svc.fit(impute_rawdata_X_train,impute_rawdata_y_train)
/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
Out[115]:
SVC(C=0.05, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
In [116]:
#predict the y value
impute_rawdata_y_predict = svc.predict(impute_rawdata_X_test)
In [117]:
#now fit the model on pca data with new dimension
svc.fit(impute_pca_X_train,impute_pca_y_train)
/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
Out[117]:
SVC(C=0.05, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
In [118]:
#predict the y value
impute_pca_y_predict = svc.predict(impute_pca_X_test)
In [119]:
accrd_005_rbf =accuracy_score(impute_rawdata_y_test,impute_rawdata_y_predict)
accpca_005_rbf =accuracy_score(impute_pca_y_test,impute_pca_y_predict)
print(accrd_005_rbf)
print(accpca_005_rbf)
0.6871165644171779
0.7423312883435583
In [120]:
#display accuracy score of both models
print("Accuracy score with impute raw data(18 dimension)",accuracy_score(impute_rawdata_y_test,impute_rawdata_y_predict))
print("Accuracy score with impute pca data(8 dimension)",accuracy_score(impute_pca_y_test,impute_pca_y_predict))
Accuracy score with impute raw data(18 dimension) 0.6871165644171779
Accuracy score with impute pca data(8 dimension) 0.7423312883435583
In [121]:
tempResultsDf = pd.DataFrame({'Method':['C-005 & rbf'], 'Accuracy score with impute raw data(18 dimension)':accrd_005_rbf , 'Accuracy score with impute pca data(8 dimension)' :accpca_005_rbf })
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'Accuracy score with impute raw data(18 dimension)','Accuracy score with impute pca data(8 dimension)']]
resultsDf
Out[121]:
Method Accuracy score with impute raw data(18 dimension) Accuracy score with impute pca data(8 dimension)
0 C-001 & linear 0.895706 0.822086
0 C-005 & linear 0.932515 0.865031
0 C-05 & linear 0.944785 0.852761
0 C-1 & linear 0.944785 0.852761
0 C-001 & rbf 0.515337 0.515337
0 C-005 & rbf 0.687117 0.742331
In [122]:
# SVM - C value 0.5 and kernel - rbf
In [123]:
svc = SVC(C=0.5,kernel='rbf', degree=3)
In [124]:
#fit the model on impute raw data
svc.fit(impute_rawdata_X_train,impute_rawdata_y_train)
/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
Out[124]:
SVC(C=0.5, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
In [125]:
#predict the y value
impute_rawdata_y_predict = svc.predict(impute_rawdata_X_test)
In [126]:
#now fit the model on pca data with new dimension
svc.fit(impute_pca_X_train,impute_pca_y_train)
/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
Out[126]:
SVC(C=0.5, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
In [127]:
#predict the y value
impute_pca_y_predict = svc.predict(impute_pca_X_test)
In [128]:
accrd_05_rbf =accuracy_score(impute_rawdata_y_test,impute_rawdata_y_predict)
accpca_05_rbf =accuracy_score(impute_pca_y_test,impute_pca_y_predict)
print(accrd_05_rbf)
print(accpca_05_rbf)
0.9693251533742331
0.950920245398773
In [129]:
#display accuracy score of both models
print("Accuracy score with impute raw data(18 dimension)",accuracy_score(impute_rawdata_y_test,impute_rawdata_y_predict))
print("Accuracy score with impute pca data(8 dimension)",accuracy_score(impute_pca_y_test,impute_pca_y_predict))
Accuracy score with impute raw data(18 dimension) 0.9693251533742331
Accuracy score with impute pca data(8 dimension) 0.950920245398773
In [130]:
tempResultsDf = pd.DataFrame({'Method':['C-05 & rbf'], 'Accuracy score with impute raw data(18 dimension)':accrd_05_rbf , 'Accuracy score with impute pca data(8 dimension)' :accpca_05_rbf })
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'Accuracy score with impute raw data(18 dimension)','Accuracy score with impute pca data(8 dimension)']]
resultsDf
Out[130]:
Method Accuracy score with impute raw data(18 dimension) Accuracy score with impute pca data(8 dimension)
0 C-001 & linear 0.895706 0.822086
0 C-005 & linear 0.932515 0.865031
0 C-05 & linear 0.944785 0.852761
0 C-1 & linear 0.944785 0.852761
0 C-001 & rbf 0.515337 0.515337
0 C-005 & rbf 0.687117 0.742331
0 C-05 & rbf 0.969325 0.950920
In [131]:
# SVM - C value 1 and kernel - rbf
In [132]:
svc = SVC(C=1,kernel='rbf', degree=3)
In [133]:
#fit the model on impute raw data
svc.fit(impute_rawdata_X_train,impute_rawdata_y_train)
/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
Out[133]:
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
In [134]:
#predict the y value
impute_rawdata_y_predict = svc.predict(impute_rawdata_X_test)
In [135]:
#now fit the model on pca data with new dimension
svc.fit(impute_pca_X_train,impute_pca_y_train)
/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
Out[135]:
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
In [136]:
#predict the y value
impute_pca_y_predict = svc.predict(impute_pca_X_test)
In [137]:
accrd_1_rbf =accuracy_score(impute_rawdata_y_test,impute_rawdata_y_predict)
accpca_1_rbf =accuracy_score(impute_pca_y_test,impute_pca_y_predict)
print(accrd_1_rbf)
print(accpca_1_rbf)
0.9693251533742331
0.9447852760736196
In [138]:
#display accuracy score of both models
print("Accuracy score with impute raw data(18 dimension)",accuracy_score(impute_rawdata_y_test,impute_rawdata_y_predict))
print("Accuracy score with impute pca data(8 dimension)",accuracy_score(impute_pca_y_test,impute_pca_y_predict))
Accuracy score with impute raw data(18 dimension) 0.9693251533742331
Accuracy score with impute pca data(8 dimension) 0.9447852760736196
In [139]:
tempResultsDf = pd.DataFrame({'Method':['C-1 & rbf'], 'Accuracy score with impute raw data(18 dimension)':accrd_1_rbf , 'Accuracy score with impute pca data(8 dimension)' :accpca_1_rbf })
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'Accuracy score with impute raw data(18 dimension)','Accuracy score with impute pca data(8 dimension)']]
resultsDf
Out[139]:
Method Accuracy score with impute raw data(18 dimension) Accuracy score with impute pca data(8 dimension)
0 C-001 & linear 0.895706 0.822086
0 C-005 & linear 0.932515 0.865031
0 C-05 & linear 0.944785 0.852761
0 C-1 & linear 0.944785 0.852761
0 C-001 & rbf 0.515337 0.515337
0 C-005 & rbf 0.687117 0.742331
0 C-05 & rbf 0.969325 0.950920
0 C-1 & rbf 0.969325 0.944785

From the above result we can conclude that C-0.5 and kernel-'rbf' is the best best hyperparameter to be used in the SVM for this data

In [140]:
#display confusion matrix of both models
print("Confusion matrix with impute raw data(18 dimension)\n",confusion_matrix(impute_rawdata_y_test,impute_rawdata_y_predict))
print("Confusion matrix with impute pca data(8 dimension)\n",confusion_matrix(impute_pca_y_test,impute_pca_y_predict))
Confusion matrix with impute raw data(18 dimension)
 [[83  0  1]
 [ 0 53  0]
 [ 3  1 22]]
Confusion matrix with impute pca data(8 dimension)
 [[82  1  1]
 [ 2 51  0]
 [ 4  1 21]]

Conclusion: From above we can see that pca is doing a very good job.Accuracy with pca is approx 95% and with raw data approx 97% but note that pca 95% accuracy is with only 8 dimension where as rawdata has 18 dimension.But every thing has two sides, disadvantage of pca is we cannot do interpretation with the model.

Additinal algorithm naive bayes

In [141]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn import metrics
In [142]:
NB_model = GaussianNB()
In [143]:
#fit the model on impute raw data
NB_model.fit(impute_rawdata_X_train,impute_rawdata_y_train)
Out[143]:
GaussianNB(priors=None, var_smoothing=1e-09)
In [144]:
#predict the y value
impute_rawdata_y_predict = NB_model.predict(impute_rawdata_X_test)
In [145]:
#now fit the model on pca data with new dimension
NB_model.fit(impute_pca_X_train,impute_pca_y_train)
Out[145]:
GaussianNB(priors=None, var_smoothing=1e-09)
In [146]:
#predict the y value
impute_pca_y_predict = NB_model.predict(impute_pca_X_test)
In [147]:
#display accuracy score of both models
print("Accuracy score with impute raw data(18 dimension)",accuracy_score(impute_rawdata_y_test,impute_rawdata_y_predict))
print("Accuracy score with impute pca data(8 dimension)",accuracy_score(impute_pca_y_test,impute_pca_y_predict))
Accuracy score with impute raw data(18 dimension) 0.6134969325153374
Accuracy score with impute pca data(8 dimension) 0.7852760736196319
In [148]:
print(metrics.classification_report(impute_rawdata_y_test,impute_rawdata_y_predict))
print(metrics.classification_report(impute_pca_y_test,impute_pca_y_predict))
              precision    recall  f1-score   support

           0       0.91      0.76      0.83        84
           1       0.94      0.28      0.43        53
           2       0.27      0.81      0.41        26

   micro avg       0.61      0.61      0.61       163
   macro avg       0.71      0.62      0.56       163
weighted avg       0.82      0.61      0.63       163

              precision    recall  f1-score   support

           0       0.83      0.96      0.89        84
           1       0.89      0.62      0.73        53
           2       0.50      0.54      0.52        26

   micro avg       0.79      0.79      0.79       163
   macro avg       0.74      0.71      0.71       163
weighted avg       0.80      0.79      0.78       163

naive bayes has given lower accuracy in compare to SVM

In [ ]: